Framing the problem¶

The goal of this report is to provide real estate developers and investors with valuable insights to make informed decisions that can significantly impact economic growth, urban development, and community well-being.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as plotly
import plotly.express as px
plotly.offline.init_notebook_mode()

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Getting the data¶

For this report, we will be using this dataset. Kaggle

A building permit is an official approval document issued by a governmental agency that allows you or your contractor to proceed with a construction or remodeling project on one's property. For more details go to https://www.thespruce.com/what-is-a-building-permit-1398344. Each city or county has its own office related to buildings, that can do multiple functions like issuing permits, inspecting buildings to enforce safety measures, modifying rules to accommodate needs of the growing population etc. For the city of San Francisco, permit issuing is taken care by www.sfdbi.org/

Why is this important: In the recent past, several posts and blogs highlighted that main discrepancy in demand and supply in real estate industry is due to delays in issuing building permits. Refer: https://www.trulia.com/blog/trends/elasticity-2016/ - Introduces concept of elasticity, and nice scatterplot of various cities. A good data story! https://biv.com/article/2014/11/city-building-permit-delays-costing-developers-tim

Content The data was downloaded for the dates ranging from Jan 1st, 2013 to Feb 25th, 2018 using the filter in San Francisco open data portal. This is the exact link: https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9/data There are 43 columns and close to 200k records in the downloaded version (kept here). Description is separately uploaded as dictionary.

In [ ]:
df = pd.read_csv('./data/Building_Permits.csv')
df.head(5)
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\855207747.py:1: DtypeWarning:

Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False.

Out[ ]:
Permit Number Permit Type Permit Type Definition Permit Creation Date Block Lot Street Number Street Number Suffix Street Name Street Suffix ... Existing Construction Type Existing Construction Type Description Proposed Construction Type Proposed Construction Type Description Site Permit Supervisor District Neighborhoods - Analysis Boundaries Zipcode Location Record ID
0 201505065519 4 sign - erect 05/06/2015 0326 023 140 NaN Ellis St ... 3.0 constr type 3 NaN NaN NaN 3.0 Tenderloin 94102.0 (37.785719256680785, -122.40852313194863) 1380611233945
1 201604195146 4 sign - erect 04/19/2016 0306 007 440 NaN Geary St ... 3.0 constr type 3 NaN NaN NaN 3.0 Tenderloin 94102.0 (37.78733980600732, -122.41063199757738) 1420164406718
2 201605278609 3 additions alterations or repairs 05/27/2016 0595 203 1647 NaN Pacific Av ... 1.0 constr type 1 1.0 constr type 1 NaN 3.0 Russian Hill 94109.0 (37.7946573324287, -122.42232562979227) 1424856504716
3 201611072166 8 otc alterations permit 11/07/2016 0156 011 1230 NaN Pacific Av ... 5.0 wood frame (5) 5.0 wood frame (5) NaN 3.0 Nob Hill 94109.0 (37.79595867909168, -122.41557405519474) 1443574295566
4 201611283529 6 demolitions 11/28/2016 0342 001 950 NaN Market St ... 3.0 constr type 3 NaN NaN NaN 6.0 Tenderloin 94102.0 (37.78315261897309, -122.40950883997789) 144548169992

5 rows × 43 columns

We can see a description of the columns below.

In [ ]:
desc = pd.read_excel('./data/DataDictionaryBuildingPermit.xlsx')
desc
Out[ ]:
Sl No Column name Description
0 1 Permit Number Number assigned while filing
1 2 Permit Type Type of the permit represented numerically.
2 3 Permit Type Definition Description of the Permit type, for example\n ...
3 4 Permit Creation Date Date on which permit created, later than \nor ...
4 5 Block Related to address
5 6 Lot Related to address
6 7 Street Number Related to address
7 8 Street Number Suffix Related to address
8 9 Street Name Related to address
9 10 Street Name Suffix Related to address
10 11 Unit Unit of a building
11 12 Unit suffix Suffix if any, for the unit
12 13 Description Details about purpose of the permit.\n Example...
13 14 Current Status Current status of the permit application.
14 15 Current Status Date Date at which current status was entered
15 16 Filed Date Filed date for the permit
16 17 Issued Date Issued date for the permit
17 18 Completed Date The date on which project was completed, \napp...
18 19 First Construction Document Date Date on which construction was documented
19 20 Structural Notification Notification to meet some legal need, given or...
20 21 Number of Existing Stories Number of existing stories in the building. \n...
21 22 Number of Proposed Stories Number of proposed stories for the constructio...
22 23 Voluntary Soft-Story \nRetrofit Soft story to meet earth quake regulations
23 24 Fire Only Permit Fire hazard prevention related permit
24 25 Permit Expiration Date Expiration date related to issued permit.
25 26 Estimated Cost Initial estimation of the cost of the project
26 27 Revised Cost Revised estimation of the cost of the project
27 28 Existing Use Existing use of the building
28 29 Existing Units Existing number of units
29 30 Proposed Use Proposed use of the building
30 31 Proposed Units Proposed number of units
31 32 Plansets Plan representation indicating the general des...
32 33 TIDF Compliance TIDF compliant or not, this is a new legal req...
33 34 Existing Construction Type Construction type, existing,as categories \nre...
34 35 Existing Construction Type Description Description of the above, for example, \nwood ...
35 36 Proposed Construction Type Construction type, proposed, as categories\n r...
36 37 Proposed Construction Type Description Description of the above
37 38 Site Permit Permit for site
38 39 Supervisor District Supervisor District to which the building loca...
39 40 Neighborhoods - Analysis Boundaries Neighborhood to which the building location be...
40 41 Zipcode Zipcode of building address
41 42 Location Location in latitude, longitude pair.
42 43 Record ID Some ID, not useful for this

Data Preparation¶

Let us see the state of our dataset.

In [ ]:
df.shape
Out[ ]:
(198900, 43)
In [ ]:
df.isnull().sum()
Out[ ]:
Permit Number                                  0
Permit Type                                    0
Permit Type Definition                         0
Permit Creation Date                           0
Block                                          0
Lot                                            0
Street Number                                  0
Street Number Suffix                      196684
Street Name                                    0
Street Suffix                               2768
Unit                                      169421
Unit Suffix                               196939
Description                                  290
Current Status                                 0
Current Status Date                            0
Filed Date                                     0
Issued Date                                14940
Completed Date                            101709
First Construction Document Date           14946
Structural Notification                   191978
Number of Existing Stories                 42784
Number of Proposed Stories                 42868
Voluntary Soft-Story Retrofit             198865
Fire Only Permit                          180073
Permit Expiration Date                     51880
Estimated Cost                             38066
Revised Cost                                6066
Existing Use                               41114
Existing Units                             51538
Proposed Use                               42439
Proposed Units                             50911
Plansets                                   37309
TIDF Compliance                           198898
Existing Construction Type                 43366
Existing Construction Type Description     43366
Proposed Construction Type                 43162
Proposed Construction Type Description     43162
Site Permit                               193541
Supervisor District                         1717
Neighborhoods - Analysis Boundaries         1725
Zipcode                                     1716
Location                                    1700
Record ID                                      0
dtype: int64

That is a lot of missing values.

Let us start by converting any date column to the actual datetime format.

In [ ]:
date_columns = ['Permit Creation Date', 'Current Status Date', 'Filed Date', 'Issued Date',
                 'Completed Date', 'First Construction Document Date', 'Permit Expiration Date']

df[date_columns] = df[date_columns].astype('datetime64[ns]')

For this report, we will try to analyze and predict how long it takes to get a permit for the building.

To do this, we will also need to create a new column in the dataset, called days_to_issue.

In [ ]:
df['Days to Issue'] = (df['Issued Date'] - df['Filed Date']).dt.days
In [ ]:
fire_only_permit_counts = df['Fire Only Permit'].value_counts()
structural_notification_counts = df['Structural Notification'].value_counts()
site_permit_counts = df['Site Permit'].value_counts()

print(fire_only_permit_counts)
print(structural_notification_counts)
print(site_permit_counts)
Fire Only Permit
Y    18827
Name: count, dtype: int64
Structural Notification
Y    6922
Name: count, dtype: int64
Site Permit
Y    5359
Name: count, dtype: int64

We can see that in the columns above, there is a Y value count in some of the rows and the rest of the rows are empty, we will fill this empty rows with N or better still, change the values to True or False.

In [ ]:
df['Fire Only Permit'] = df['Fire Only Permit'].map({'Y': True, np.nan: False})
df['Structural Notification'] = df['Structural Notification'].map({'Y': True, np.nan: False})
df['Site Permit'] = df['Site Permit'].map({'Y': True, np.nan: False})

print(df['Fire Only Permit'].value_counts())
print(df['Structural Notification'].value_counts())
print(df['Site Permit'].value_counts())
Fire Only Permit
False    180073
True      18827
Name: count, dtype: int64
Structural Notification
False    191978
True       6922
Name: count, dtype: int64
Site Permit
False    193541
True       5359
Name: count, dtype: int64

Some of the columns are empty or almost empty so there is no way to remedy them so we will be dropping those columns.

In [ ]:
columns_to_drop = ['TIDF Compliance', 'Voluntary Soft-Story Retrofit']
df.drop(columns=columns_to_drop, inplace=True)

Let us see how the dataset looks now with the preliminary cleaning done.

In [ ]:
missing_values = df.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=missing_values.index, y=missing_values.values, hue=missing_values.index, palette="viridis", dodge=False, legend=False)
plt.xticks(rotation=90)
plt.title('Missing Values')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.show()
No description has been provided for this image
In [ ]:
missing_values = df.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)

# Convert the missing values to a DataFrame for Plotly
missing_values_df = missing_values.reset_index()
missing_values_df.columns = ['Columns', 'Number of Missing Values']

# Create a bar plot using Plotly
fig = px.bar(missing_values_df, x='Columns', y='Number of Missing Values', 
             title='Missing Values', 
             labels={'Columns': 'Columns', 'Number of Missing Values': 'Number of Missing Values'}, 
             color='Columns')

fig.update_layout(xaxis={'categoryorder':'total descending'}, xaxis_tickangle=-90, height=1000)
fig.show()

We can see that the Unit suffix, Street Number Suffix and Unit contain the most missing values but they aren't really relevant because not all houses or buildings have one and they don't affect the permit issuance.

In [ ]:
# calculate the percentage of missing values for each column
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values = missing_values / df.shape[0] * 100
missing_values = missing_values.sort_values(ascending=False)

missing_values
Out[ ]:
Unit Suffix                               99.014077
Street Number Suffix                      98.885872
Unit                                      85.178984
Completed Date                            51.135747
Permit Expiration Date                    26.083459
Existing Units                            25.911513
Proposed Units                            25.596280
Existing Construction Type Description    21.802916
Existing Construction Type                21.802916
Proposed Construction Type Description    21.700352
Proposed Construction Type                21.700352
Number of Proposed Stories                21.552539
Number of Existing Stories                21.510307
Proposed Use                              21.336853
Existing Use                              20.670689
Estimated Cost                            19.138260
Plansets                                  18.757667
First Construction Document Date           7.514329
Issued Date                                7.511312
Days to Issue                              7.511312
Revised Cost                               3.049774
Street Suffix                              1.391654
Neighborhoods - Analysis Boundaries        0.867270
Supervisor District                        0.863248
Zipcode                                    0.862745
Location                                   0.854701
Description                                0.145802
dtype: float64
In [ ]:
# drop the following columns unit suffix, street suffix, street number suffix, and zip code
columns_to_drop = ['Unit Suffix', 'Unit', 'Street Suffix', 'Street Number Suffix']
df.drop(columns=columns_to_drop, inplace=True)
df
Out[ ]:
Permit Number Permit Type Permit Type Definition Permit Creation Date Block Lot Street Number Street Name Description Current Status ... Existing Construction Type Description Proposed Construction Type Proposed Construction Type Description Site Permit Supervisor District Neighborhoods - Analysis Boundaries Zipcode Location Record ID Days to Issue
0 201505065519 4 sign - erect 2015-05-06 0326 023 140 Ellis ground fl facade: to erect illuminated, electr... expired ... constr type 3 NaN NaN False 3.0 Tenderloin 94102.0 (37.785719256680785, -122.40852313194863) 1380611233945 187.0
1 201604195146 4 sign - erect 2016-04-19 0306 007 440 Geary remove (e) awning and associated signs. issued ... constr type 3 NaN NaN False 3.0 Tenderloin 94102.0 (37.78733980600732, -122.41063199757738) 1420164406718 471.0
2 201605278609 3 additions alterations or repairs 2016-05-27 0595 203 1647 Pacific installation of separating wall withdrawn ... constr type 1 1.0 constr type 1 False 3.0 Russian Hill 94109.0 (37.7946573324287, -122.42232562979227) 1424856504716 NaN
3 201611072166 8 otc alterations permit 2016-11-07 0156 011 1230 Pacific repair dryrot & stucco at front of bldg. complete ... wood frame (5) 5.0 wood frame (5) False 3.0 Nob Hill 94109.0 (37.79595867909168, -122.41557405519474) 1443574295566 253.0
4 201611283529 6 demolitions 2016-11-28 0342 001 950 Market demolish retail/office/commercial 3-story buil... issued ... constr type 3 NaN NaN False 6.0 Tenderloin 94102.0 (37.78315261897309, -122.40950883997789) 144548169992 368.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
198895 M862628 8 otc alterations permit 2017-12-05 0113 017A 1228 Montgomery street space issued ... NaN NaN NaN False NaN NaN NaN NaN 1489337276729 0.0
198896 201712055595 8 otc alterations permit 2017-12-05 0271 014 580 Bush fire alarm upgrade ref 201704123852 issued ... wood frame (5) 5.0 wood frame (5) False NaN NaN NaN NaN 1489462354993 1.0
198897 M863507 8 otc alterations permit 2017-12-06 4318 019 1568 Indiana street space issued ... NaN NaN NaN False NaN NaN NaN NaN 1489539379952 0.0
198898 M863747 8 otc alterations permit 2017-12-06 0298 029 795 Sutter street space permit issued ... NaN NaN NaN False NaN NaN NaN NaN 1489608233656 0.0
198899 M864287 8 otc alterations permit 2017-12-07 0160 006 838 Pacific street space permit issued ... NaN NaN NaN False NaN NaN NaN NaN 1489796283803 0.0

198900 rows × 38 columns

In [ ]:
df.isnull().sum()
Out[ ]:
Permit Number                                  0
Permit Type                                    0
Permit Type Definition                         0
Permit Creation Date                           0
Block                                          0
Lot                                            0
Street Number                                  0
Street Name                                    0
Description                                  290
Current Status                                 0
Current Status Date                            0
Filed Date                                     0
Issued Date                                14940
Completed Date                            101709
First Construction Document Date           14946
Structural Notification                        0
Number of Existing Stories                 42784
Number of Proposed Stories                 42868
Fire Only Permit                               0
Permit Expiration Date                     51880
Estimated Cost                             38066
Revised Cost                                6066
Existing Use                               41114
Existing Units                             51538
Proposed Use                               42439
Proposed Units                             50911
Plansets                                   37309
Existing Construction Type                 43366
Existing Construction Type Description     43366
Proposed Construction Type                 43162
Proposed Construction Type Description     43162
Site Permit                                    0
Supervisor District                         1717
Neighborhoods - Analysis Boundaries         1725
Zipcode                                     1716
Location                                    1700
Record ID                                      0
Days to Issue                              14940
dtype: int64
In [ ]:
#get me the rows with missing values in the issued date column
missing_issued_date = df[df['Issued Date'].isnull()]
In [ ]:
# show the distribution of the current status column
plt.figure(figsize=(12, 6))
sns.countplot(data=missing_issued_date, x='Current Status', palette='viridis')
plt.xticks(rotation=45)
plt.title('Current Status Distribution')
plt.xlabel('Current Status')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\385284598.py:3: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


No description has been provided for this image
In [ ]:
# distribution of current status column
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Current Status', palette='viridis')
plt.xticks(rotation=45)
plt.title('Current Status Distribution')
plt.xlabel('Current Status')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\2324579989.py:3: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


No description has been provided for this image
In [ ]:
# unique count of the current status column
df['Current Status'].value_counts()
Out[ ]:
Current Status
complete       97077
issued         83559
filed          12043
withdrawn       1754
cancelled       1536
expired         1370
approved         733
reinstated       563
suspend          193
revoked           50
plancheck         16
appeal             2
disapproved        2
incomplete         2
Name: count, dtype: int64
In [ ]:
# give me a dataframe with current status = cancelled or suspend or revoked
cancelled_suspended_revoked = df[(df['Current Status'] == 'cancelled') | (df['Current Status'] == 'suspend') | (df['Current Status'] == 'revoked')]
cancelled_suspended_revoked.head(5)
Out[ ]:
Permit Number Permit Type Permit Type Definition Permit Creation Date Block Lot Street Number Street Name Description Current Status ... Existing Construction Type Description Proposed Construction Type Proposed Construction Type Description Site Permit Supervisor District Neighborhoods - Analysis Boundaries Zipcode Location Record ID Days to Issue
90 201303222819 7 wall or painted sign 2013-03-22 0719 002 901 Van Ness to erect electric, projecting, double faced si... cancelled ... NaN NaN NaN False 5.0 Western Addition 94109.0 (37.78400192377985, -122.42169433153967) 129957380462 NaN
93 201501206148 3 additions alterations or repairs 2015-01-20 1226 019 418 Shrader rear structure - converting structure at the r... cancelled ... wood frame (5) 5.0 wood frame (5) False 5.0 Haight Ashbury 94117.0 (37.770762127562946, -122.45184046838122) 1370343502813 NaN
100 201603313579 3 additions alterations or repairs 2016-03-31 1289 020 1591 Shrader building addition deck & balcony at (e) 3-stor... cancelled ... wood frame (5) 5.0 wood frame (5) True 5.0 Inner Sunset 94117.0 (37.760086498456246, -122.45055881787229) 141801095518 NaN
125 201705035616 8 otc alterations permit 2017-05-03 4150 021 1061 Florida reroofing cancelled ... wood frame (5) 5.0 wood frame (5) False 9.0 Mission 94110.0 (37.754907619215444, -122.40999547082743) 1461797171313 0.0
571 201301027105 8 otc alterations permit 2013-01-02 1173 007 40 Loyola upgrade existing furnace room by installing 5/... cancelled ... wood frame (5) 5.0 wood frame (5) False 1.0 Lone Mountain/USF 94117.0 (37.776229196787504, -122.44877581259863) 129220091307 0.0

5 rows × 38 columns

In [ ]:
# show the construction type distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=cancelled_suspended_revoked, x='Permit Type Definition', palette='viridis')
plt.xticks(rotation=45)
plt.title('Construction Type Distribution')
plt.xlabel('Construction Type')
plt.ylabel('Count')
plt.show()
C:\Users\Owner\AppData\Local\Temp\ipykernel_55036\3381986286.py:3: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


No description has been provided for this image

Our most important columns here seem to be Issued Date and Days to Issue, which 7% of their values are missing. We will remove all the rows with missing values from these columns.

In [ ]:
# remove rows with missing values in Issued Date and Days to Issue columns
df.dropna(subset=['Issued Date', 'Days to Issue'], inplace=True)

df.isnull().sum()
Out[ ]:
Permit Number                                 0
Permit Type                                   0
Permit Type Definition                        0
Permit Creation Date                          0
Block                                         0
Lot                                           0
Street Number                                 0
Street Name                                   0
Description                                  86
Current Status                                0
Current Status Date                           0
Filed Date                                    0
Issued Date                                   0
Completed Date                            86788
First Construction Document Date             32
Structural Notification                       0
Number of Existing Stories                40757
Number of Proposed Stories                40822
Fire Only Permit                              0
Permit Expiration Date                    36960
Estimated Cost                            37065
Revised Cost                                  9
Existing Use                              39187
Existing Units                            48685
Proposed Use                              40414
Proposed Units                            48197
Plansets                                  36931
Existing Construction Type                41292
Existing Construction Type Description    41292
Proposed Construction Type                41084
Proposed Construction Type Description    41084
Site Permit                                   0
Supervisor District                        1657
Neighborhoods - Analysis Boundaries        1664
Zipcode                                    1656
Location                                   1641
Record ID                                     0
Days to Issue                                 0
dtype: int64

For the rest columns, we will replace with the mode by grouping the columns.

In [ ]:
import pandas as pd

def fill_missing_values_all_columns(df, group_columns):
    df_copy = df.copy()

    def fill_missing(x):
        mode_values = x.mode()
        filled_x = x.fillna(mode_values[0] if not mode_values.empty else x)
        return filled_x.infer_objects(copy=False)

    for target_column in df_copy.columns:
        if target_column not in group_columns:
            df_copy[target_column] = df_copy.groupby(group_columns)[target_column].transform(fill_missing)

    return df_copy

# Optionally, set the Pandas option to opt-in to the future behavior
pd.set_option('future.no_silent_downcasting', True)

group_columns = ['Street Name', 'Block']
df_filled = fill_missing_values_all_columns(df, group_columns)
missing_values = df_filled.isnull().sum()
missing_values = missing_values.sort_values(ascending=False)

print(missing_values)
Completed Date                            4368
Existing Units                            1252
Existing Construction Type                1204
Existing Construction Type Description    1204
Proposed Units                            1145
Number of Existing Stories                1121
Proposed Construction Type Description    1035
Proposed Construction Type                1035
Number of Proposed Stories                 982
Existing Use                               957
Proposed Use                               905
Estimated Cost                             773
Permit Expiration Date                     772
Plansets                                   772
Neighborhoods - Analysis Boundaries        101
Zipcode                                     94
Supervisor District                         92
Location                                    89
Description                                  2
First Construction Document Date             1
Site Permit                                  0
Record ID                                    0
Permit Number                                0
Revised Cost                                 0
Permit Type                                  0
Fire Only Permit                             0
Structural Notification                      0
Issued Date                                  0
Filed Date                                   0
Current Status Date                          0
Current Status                               0
Street Name                                  0
Street Number                                0
Lot                                          0
Block                                        0
Permit Creation Date                         0
Permit Type Definition                       0
Days to Issue                                0
dtype: int64
In [ ]:
df_filled.isnull().sum()
Out[ ]:
Permit Number                                0
Permit Type                                  0
Permit Type Definition                       0
Permit Creation Date                         0
Block                                        0
Lot                                          0
Street Number                                0
Street Name                                  0
Description                                  2
Current Status                               0
Current Status Date                          0
Filed Date                                   0
Issued Date                                  0
Completed Date                            4368
First Construction Document Date             1
Structural Notification                      0
Number of Existing Stories                1121
Number of Proposed Stories                 982
Fire Only Permit                             0
Permit Expiration Date                     772
Estimated Cost                             773
Revised Cost                                 0
Existing Use                               957
Existing Units                            1252
Proposed Use                               905
Proposed Units                            1145
Plansets                                   772
Existing Construction Type                1204
Existing Construction Type Description    1204
Proposed Construction Type                1035
Proposed Construction Type Description    1035
Site Permit                                  0
Supervisor District                         92
Neighborhoods - Analysis Boundaries        101
Zipcode                                     94
Location                                    89
Record ID                                    0
Days to Issue                                0
dtype: int64
In [ ]:
df = df_filled.copy()

Data Analysis¶

Permit Analysis¶

In [ ]:
permit_type_def_counts = df['Permit Type Definition'].value_counts()
print(permit_type_def_counts)
Permit Type Definition
otc alterations permit                 170524
additions alterations or repairs         9439
sign - erect                             2403
new construction wood frame               601
demolitions                               377
wall or painted sign                      362
new construction                          176
grade or quarry or fill or excavate        78
Name: count, dtype: int64

Permit Type Definitions:

OTC Alterations Permit: Minor alterations or repairs processed quickly.
Additions Alterations or Repairs: Permits for additions, alterations, or repairs.
Sign - Erect: Permits for installation or erection of signs.
New Construction Wood Frame: Permits for new buildings with wood frame structure.
Demolitions: Permits for demolition of existing structures.
Wall or Painted Sign: Permits for wall signs or painted signs.
New Construction: Permits for entirely new construction projects.
Grade or Quarry or Fill or Excavate: Permits for grading, quarrying, filling, or excavating land.

In [ ]:
plt.figure(figsize=(12, 6))
count_plot = sns.countplot(x='Permit Type Definition', data=df, order=df['Permit Type Definition'].value_counts().index)
plt.title('Permit Type Definition Counts')
plt.xlabel('Permit Type Definition')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')

for p in count_plot.patches:
    count_plot.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                        ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()
No description has been provided for this image
In [ ]:
# Create a DataFrame with the counts
permit_counts = df['Permit Type Definition'].value_counts().reset_index()
permit_counts.columns = ['Permit Type Definition', 'Count']

# Create a bar plot using Plotly
fig = px.bar(permit_counts, x='Permit Type Definition', y='Count', 
             title='Permit Type Definition Counts', 
             labels={'Permit Type Definition': 'Permit Type Definition', 'Count': 'Count'},
             text='Count')

# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=1000)
fig.update_traces(textposition='outside')

# Show the plot
fig.show()
In [ ]:
# Calculate the mean days to issue for each permit type
mean_days_permit_type = df.groupby('Permit Type Definition')['Days to Issue'].mean().reset_index()
mean_days_permit_type = mean_days_permit_type.sort_values(by='Days to Issue')

# Create a bar plot using Plotly
fig = px.bar(mean_days_permit_type, x='Permit Type Definition', y='Days to Issue',
             title='Average Days to Issue for Each Permit Type',
             labels={'Permit Type Definition': 'Permit Type Definition', 'Days to Issue': 'Average Days to Issue'},
             text='Days to Issue',
             color='Days to Issue')

# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=800)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')

# Show the plot
fig.show()
In [ ]:
permit_type_counts = df['Permit Type Definition'].value_counts().reset_index()
permit_type_counts.columns = ['Permit Type Definition', 'Count']

mean_days_permit_type = df.groupby('Permit Type Definition')['Days to Issue'].mean().reset_index()
mean_days_permit_type = mean_days_permit_type.sort_values(by='Days to Issue')
mean_days_permit_type.columns = ['Permit Type Definition', 'Avg Days to Issue']

result_df = pd.merge(permit_type_counts, mean_days_permit_type, on='Permit Type Definition')
display(result_df)
Permit Type Definition Count Avg Days to Issue
0 otc alterations permit 170524 10.870757
1 additions alterations or repairs 9439 247.252251
2 sign - erect 2403 52.724511
3 new construction wood frame 601 398.143095
4 demolitions 377 348.175066
5 wall or painted sign 362 45.348066
6 new construction 176 478.602273
7 grade or quarry or fill or excavate 78 97.307692

Days to Visit¶

How we can understand, which day of the week is the best for Department of Building Inspections visit?

We can look at how many days it took for considering Permit, depending on which day of the week the Permit was Filed

In [ ]:
# Extract the weekday from the 'Filed Date' column
df['Weekday'] = df['Filed Date'].dt.day_name()

# Calculate the average days to issue for each weekday
average_days_per_weekday = df.groupby('Weekday')['Days to Issue'].mean().reset_index()

# Define the order of the weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Sort the DataFrame by the defined order
average_days_per_weekday['Weekday'] = pd.Categorical(average_days_per_weekday['Weekday'], categories=weekday_order, ordered=True)
average_days_per_weekday = average_days_per_weekday.sort_values('Weekday')

# Create a bar plot using Plotly
fig = px.bar(average_days_per_weekday, x='Weekday', y='Days to Issue', 
             title='Average Days to Issue for Each Day of the Week',
             labels={'Weekday': 'Day of the Week', 'Days to Issue': 'Average Days to Issue'},
             text='Days to Issue',
             color='Weekday',
             category_orders={'Weekday': weekday_order})

# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-45, height=800)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')

# Show the plot
fig.show()

Further Analysis¶

Let us look

In [ ]:
numeric_columns = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_columns.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
In [ ]:
# Select relevant columns

df_rel = df[['Neighborhoods - Analysis Boundaries', 'Existing Construction Type Description', 'Permit Type']]
In [ ]:
# Create a Heatmap for Neighborhood vs Construction Type vs Permit Type

plt.figure(figsize=(15, 8))
heatmap_data = pd.crosstab(index=df_rel['Neighborhoods - Analysis Boundaries'], 
                           columns=df_rel['Existing Construction Type Description'], 
                           values=df_rel['Permit Type'], 
                           aggfunc='count', 
                           normalize='index')

sns.heatmap(heatmap_data, cmap='YlGnBu', annot=True, fmt=".2f", cbar_kws={'label': 'Percentage of Permits'})
plt.title('Neighborhoods vs Construction Types vs Permit Types')
plt.xlabel('Construction Types')
plt.ylabel('Neighborhoods')
plt.show()
No description has been provided for this image
In [ ]:
df['Permit Type'].value_counts()
Out[ ]:
Permit Type
8    170524
3      9439
4      2403
2       601
6       377
7       362
1       176
5        78
Name: count, dtype: int64

Prediction Model¶

Random Forest Regressor¶

In [ ]:
#datatype of the permit type column
df['Permit Type']
Out[ ]:
dtype('int64')
In [ ]:
from sklearn.ensemble import RandomForestRegressor

# Preparation and Modeling

df['Day_of_Week'] = df['Filed Date'].dt.dayofweek
df['Month'] = df['Filed Date'].dt.month

selected_features = [
    'Number of Existing Stories', 'Estimated Cost', 'Day_of_Week', 'Month',
    'Existing Construction Type', 'Revised Cost', 'Existing Units', 'Permit Type'
]

df_model = df[selected_features + ['Days to Issue']].dropna()
X = df_model.drop('Days to Issue', axis=1)
y = df_model['Days to Issue']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

model_days_to_issue = RandomForestRegressor(random_state=40)
model_days_to_issue.fit(X_train, y_train)
Out[ ]:
RandomForestRegressor(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=40)
In [ ]:
# Training and Evaluation

from sklearn.metrics import mean_squared_error, r2_score

y_pred_train = model_days_to_issue.predict(X_train)
y_pred_test = model_days_to_issue.predict(X_test)

mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)

print(f'Training Set - Mean Squared Error: {mse_train}, R-squared: {r2_train}')
print(f'Testing Set - Mean Squared Error: {mse_test}, R-squared: {r2_test}')
Training Set - Mean Squared Error: 648.598706950446, R-squared: 0.9208572781227299
Testing Set - Mean Squared Error: 3549.8507175360546, R-squared: 0.5613536531588268

Gradient Boosting Regressor¶

In [ ]:
from sklearn.ensemble import GradientBoostingRegressor

model_gb = GradientBoostingRegressor(random_state=40)
model_gb.fit(X_train, y_train)

y_pred_train_gb = model_gb.predict(X_train)
y_pred_test_gb = model_gb.predict(X_test)

mse_train_gb = mean_squared_error(y_train, y_pred_train_gb)
r2_train_gb = r2_score(y_train, y_pred_train_gb)
mse_test_gb = mean_squared_error(y_test, y_pred_test_gb)
r2_test_gb = r2_score(y_test, y_pred_test_gb)

print(f'Gradient Boosting - Training Set - Mean Squared Error: {mse_train_gb}, R-squared: {r2_train_gb}')
print(f'Gradient Boosting - Testing Set - Mean Squared Error: {mse_test_gb}, R-squared: {r2_test_gb}')
Gradient Boosting - Training Set - Mean Squared Error: 3911.5567887013053, R-squared: 0.5227075729292396
Gradient Boosting - Testing Set - Mean Squared Error: 4013.9157931916643, R-squared: 0.5040102699209544

Support Vector Regressor (SVR)¶

In [ ]:
from sklearn.svm import SVR

model_svr = SVR(kernel='rbf')
model_svr.fit(X_train, y_train)

y_pred_train_svr = model_svr.predict(X_train)
y_pred_test_svr = model_svr.predict(X_test)

mse_train_svr = mean_squared_error(y_train, y_pred_train_svr)
r2_train_svr = r2_score(y_train, y_pred_train_svr)
mse_test_svr = mean_squared_error(y_test, y_pred_test_svr)
r2_test_svr = r2_score(y_test, y_pred_test_svr)

print(f'SVR - Training Set - Mean Squared Error: {mse_train_svr}, R-squared: {r2_train_svr}')
print(f'SVR - Testing Set - Mean Squared Error: {mse_test_svr}, R-squared: {r2_test_svr}')
SVR - Training Set - Mean Squared Error: 8116.616614317737, R-squared: 0.009601585066902452
SVR - Testing Set - Mean Squared Error: 8038.314680830896, R-squared: 0.00672516957174496

XGBoost Regressor¶

In [ ]:
import xgboost as xgb

model_xgb = xgb.XGBRegressor(random_state=40)
model_xgb.fit(X_train, y_train)

y_pred_train_xgb = model_xgb.predict(X_train)
y_pred_test_xgb = model_xgb.predict(X_test)

mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb)
r2_train_xgb = r2_score(y_train, y_pred_train_xgb)
mse_test_xgb = mean_squared_error(y_test, y_pred_test_xgb)
r2_test_xgb = r2_score(y_test, y_pred_test_xgb)

print(f'XGBoost - Training Set - Mean Squared Error: {mse_train_xgb}, R-squared: {r2_train_xgb}')
print(f'XGBoost - Testing Set - Mean Squared Error: {mse_test_xgb}, R-squared: {r2_test_xgb}')
XGBoost - Training Set - Mean Squared Error: 2764.717153027853, R-squared: 0.6626461965362527
XGBoost - Testing Set - Mean Squared Error: 3746.257006247557, R-squared: 0.5370842097666461

Conclusion¶

While we were able to get some alaysis from this dataset that will guide businesses in making decisions. The prediction on how long it takes a permit to get approved still needs some work.